Combining Orientational Pooling Features for Scene Recognition
نویسندگان
چکیده
Scene recognition is a basic task towards image understanding. Spatial Pyramid Matching (SPM) has been shown to be an efficient solution for spatial context modeling. In this paper, we introduce an alternative approach, Orientational Pyramid Matching (OPM), for orientational context modeling. Our approach is motivated by the observation that the 3D orientations of objects are a crucial factor to discriminate indoor scenes. The novelty lies in that OPM uses the 3D orientations to form the pyramid and produce the pooling regions, which is unlike SPM that uses the spatial positions to form the pyramid. Experimental results on challenging scene classification tasks show that OPM achieves the performance comparable with SPM and that OPM and SPM make complementary contributions so that their combination gives the state-of-the-art performance. 1. The Bag-of-Features Model The BoF model is composed of three basic stages: local descriptor extraction, feature encoding, and spatial pooling. The local feature extraction stage usually extracts a set of local descriptors, e.g., SIFT [8] or HOG [2], from the interest points or densely-sampled image patches of an image. The feature encoding module then assigns each descriptor to the closest entry in a visual vocabulary: a codebook learned offline by clustering a large set of descriptors with K-Means or Gaussian Mixture Model (GMM) algorithm. Feature encoding can also be sparse [13] or high-dimensional [9]. Spatial pooling consists of partitioning an image into a set of regions, aggregating feature-level statistics over these regions [18], and normalizing then concatenating the region descriptors as an image-level feature vector [16]. Image partition can be obtained by Spatial Pyramid Matching (SPM) [7]. Aggregation of descriptors within a region is often performed with a pooling strategy. 2. Our Approach In this section, we first introduce the proposed Orientational Pyramid Matching model, and then present the algorithm of estimating the 3D orientations for image patches. 2.1. Orientational Pyramid Matching Given a set of patch descriptors that are extracted from interest points or densely-sampled regions, the goal is to summarize then into an image-level feature vector. Different from Spatial Pyramid Matching (SPM) in which each patch descriptor is associated with its spatial position, our approach augments the patch descriptor f with an additional 3D orientation denoted by the azimuth and polar angles o = (θ, φ) . We denote the set of encoded local features as S = {(f1,o1) , (f2,o2) , . . . , (fM ,oM )}. The proposed Orientational Pyramid Matching (OPM) algorithm starts with partitioning the set S into subsets {St}, t = 1, 2, . . . , TO, where each subset consists of the patch descriptors that are close in the orientational angles rather than the spatial positions used in Spatial Pyramid Matching (SPM). The partition can be done in various ways, such as clustering the angles. In this paper, we follow the simple way similar to SPM and perform a regular partition scheme, i.e., dividing the orientational space U = [ −π2 , π 2 ]2 into regular grids, which is shown to perform well in practice. Let LA and LP be the numbers of the pyramid layers along the azimuth and polar angles, respectively. The bin in the l-th layer along the azimuth/polar angles is then of size π 2min{l,LA} × π 2min{l,LP} , i.e., the number of orientational pooling bins in the l-th layer is 2min{l,LA} × 2min{l,LP}. Denote the set of partitions produced from orientational pyramid by R1,R2, . . . ,RTO . Each region Rt contains a set of Mt patch descriptors {ft,1, ft,2, . . . , ft,Mt}. We aggregate the Mt features together to generate a descriptor ft for regionRt. The overall image feature is then obtained by concatenating the pooled feature vectors of all the regions.
منابع مشابه
Scene Aligned Pooling for Complex Video Recognition
Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in unconstrained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key i...
متن کاملScene Recognition by Combining Local and Global Image Descriptors
Object recognition is an important problem in computer vision, having diverse applications. In this work, we construct an end-to-end scene recognition pipeline consisting of feature extraction, encoding, pooling and classification. Our approach simultaneously utilize global feature descriptors as well as local feature descriptors from images, to form a hybrid feature descriptor corresponding to...
متن کاملLearning Hybrid Part Filters for Scene Recognition
This paper introduces a new image representation for scene recognition, where an image is described based on the response maps of object part filters. The part filters are learned from existing datasets with object location annotations, using deformable part-based models trained by latent SVM [1]. Since different objects may contain similar parts, we describe a method that uses a semantic hiera...
متن کاملMultiple spatial pooling for visual object recognition
Global spatial structure is an important factor for visual object recognition but has not attracted sufficient attention in recent studies. Especially, the problems of features' ambiguity and sensitivity to location change in the image space are not yet well solved. In this paper, we propose multiple spatial pooling (MSP) to address these problems. MSP models global spatial structure with multi...
متن کاملFORNONI, CAPUTO: SALIENCY-DRIVEN POOLING FOR INDOOR SCENE RECOGNITION 1 Indoor Scene Recognition using Task and Saliency-driven Feature Pooling
Indoor scenes are characterized by a high intra-class variability, mainly due to the intrinsic variety of the objects in them, and to the drastic image variations due to (even small) view-point changes. One of the main trends in the literature has been to employ representations coupling statistical characterizations of the image, with a description of their spatial distribution. This is usually...
متن کامل